Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Vishal C V, Sathvik K B, Nischay N, Manoj Athreya H, Sagar N
DOI Link: https://doi.org/10.22214/ijraset.2021.39354
Certificate: View Certificate
Statistics has always been an integral part of the sporting world. Selectors pick players based on numerous factors such as averages, strike-rates, runs scored or goals scored. Teams have exclusive ‘talent hunters’, who spend weeks, if not months, trying to uncover talent from different parts of the world. With the rise of this new niche field called Sports Analytics, teams can now perform player evaluations on tons of data that is available. This paper aims to examine the factors that truly indicate the capacity of cricket players to perform at the top-most level – international cricket. Though this research has been carried out on cricket data, it is hoped that similar methods can be used to hunt for true talent in other sports!
I. INTRODUCTION
Strike-rates, averages, centuries, wickets and economy-rates form the crux of cricket statistics. As a matter of fact, domestic cricket, just like international cricket, is played in three different forms – First-class cricket, List A cricket and T20 domestic. What form of the game is most indicative? For a batter, what parameter matters more – strike-rate or average? Does a wrist-spinner’s statistics carry special weight in contrast to that of a finger-spinner? The purpose of this research was to find answers to many such questions. Some outcomes were as expected, whereas others were surprising. This paper discusses the findings, and how these findings can affect the future of cricket and sports analytics.
II. PROBLEM STATEMENT AND OBJECTIVES
As the title suggests, the main objective of this research was to determine the statistical importance of various numeric and categorical parameters and in turn, harness their predictive power to forecast the rate of success of a player. Exploratory Data Analysis has also been performed in an attempt to detect relationships between various parameters – something that Machine Learning may not always do. Some questions that have been answered are:
III. LITERATURE SURVEY
A. Satyam Mukherjee. (2012). Quantifying individual performance in Cricket − A network analysis of Batsmen and Bowlers.
This paper [1] provides a revised approach for determining the 'quality' of runs scored by a batter or wickets taken by a bowler in this paper. We look at how Social Network Analysis (SNA) can be used to evaluate the effectiveness of team members. Using the player-vs-player information available for Test and ODI cricket, they have created a directed and weighted network of batters-bowlers and also a network of batters and bowlers based on batters’ dismissal records throughout cricket's history. Their method might be used to evaluate a player's performance in domestic contests, paving the path for a more balanced team selection for international matchups but ours is a more streamlined approach for analysing the data.
B. Vipul Punjabi, Rohit Chaudhari, Devendra Pal, Kunal Nhavi, Nikhil Shimpi, Harshal Joshi. (2019). A survey on team selection in game of cricket using machine learning.
This study [2] tries to predict player performance, such as how many runs each batter will score and how many wickets each bowler will take for both teams. Both issues are characterised as classification problems, with the number of runs and wickets falling into separate ranges. This paper focused more on the venue aspect and how the players are going to perform there. Some more data is required to be fed to come to a better conclusion and some more analysis of domestic cricket will be helpful like we do.
C. Saikia, Hemanta & Bhattacharjee, Dibyojyoti & Krishnan, Unni. (2016). A New Model for Player Selection in Cricket. International Journal of Performance Analysis in Sport.
The paper [3] offers a metric that can be used to convert a cricketer's performance into a single numerical value that can be used to calculate the player's cricketing efficiency. The distributional pattern of the performance metric is determined and then used to determine the best performers in various fields of expertise. The selectors' task is made easier as a result of the exercise because they now have a reduced set of options to choose from. This paper has used many formulas to calculate the performance which cannot be told in general for all players as there are various factors that need to be considered.
D. Subramanian Rama Iyer, Ramesh Sharadha. (2009). Prediction of athlete’s performance using neural networks: An application in cricket team selection.
The paper [4] uses neural networks to forecast each cricketer's future results based on their previous performance and then divide them into three categories: performers, moderates, and failures. Based on the rating that the player has received, the model will recommend if the player should be included in the squad or not. Our paper takes more data into consideration and gives further analysis of the players by taking even their domestic stats into consideration.
IV. RESEARCH METHODOLOGY
Data of over 500 players was web-scraped from sites such as ESPNCricinfo and Cricbuzz using Beautiful Soup, a popular Python library for web scraping.
These players were first stratified into four categories – wicket-keeper batters, batters, bowlers and all-rounders. Less than 10% of the players were wicket-keeper batters.
Also, analysis suggested that no wicket-keeper who exhibited poor battership has been successful. So, wicket-keeper batters were categorized as batters, making the total number of categories three. Any analysis and model building were performed on the categories individually, since each category comprised of different parameters. Batting oriented parameters such as batting strike-rate and number of runs were dropped for bowlers.
Economy rates and bowling averages were dropped for batters. To ensure the significance of any outcome or result, players who played in less than 40 international matches were dropped from each of the categories. The final dataset consisted data of 148 batters, 128 bowlers and 85 all-rounders.
A. Data Pre-processing and Analysis
After the classification of data, null values had to be handled. T20 and franchise is relatively new to the sport. In fact, from the inception of T20 cricket in 2004, up till 2010, less than 130 T20 International Matches had been played. By 2020, however, more than 1000 T20 matches had been played. In fact, many top cricketers have not even played T20 cricket. This is a case of MNAR (Missing not at random). To replace null values in the T20 average and strike rate columns, the Iterative Imputer was used. Sklearn’s Iterative Imputer models zero-null features as a function of features with null values, in turn computing values that replace null values.
Multiple imputation ensures that all features have been taken into consideration in the computation of a value of the given variable. In computing T20 strike-rates, considering the year of debut becomes as important as any other substantial factor. Majority of players who debuted before 2007 have strike-rates under 80 whereas majority of the players who debuted after 2007 have strike-rates over 80. One of main reasons for this change in trends is due to the introduction of T20 cricket in 2006, which changed the face of the game. Not only did batters start scoring at a high SR in T20 cricket, this change was brought about in ODIs as well as in the domestic equivalent of ODIs, List A cricket.
T20 averages and strike rates shared some sort of a linear relationship with other parameters such as first-class and list-a averages and strike-rates. Hence, the Bayesian Ridge form of multiple imputation was used.
Bayesian Ridge is a version of linear regression where in point estimates are replaced by probability distributors (i.e. y is not an exclusive output, but is rather drawn from a Gaussian Distribution).
After multiple imputation of missing data, Exploratory Data Analysis was carried out, whose results have been discussed in the next section.
In order to prepare the data for model building, it was scaled using the standard scalar, given the symmetric distribution of data.
Figure 4 standard scaler is computed by subtracting the mean of all observations from x and dividing the resultant by the standard-deviation of all observations
Models were fit on each of the three data-frames, in an attempt to accurately predict the success rate of an international cricketer. The number of player of the match awards as a percentage of international matches played was taken as the metric for success-rate. F(x) was the man of the match percentage, referred throughout this paper as ‘motm_perc’.
The next section is divided into four parts – Impact of multiple imputation, Exploratory Data Analysis, Models’ performance before Principal Component Analysis, and Models’ performance after Principal Component Analysis.
V. RESULTS AND INTERPRETATION
A. Impact of multiple imputation
The foremost question after the pre-processing of data was ‘How well did the iterative imputer perform?’ The only way to judge is by examining the computed values. Given below is the data for Indian Cricketer VVS Laxman. VVS Laxman played no T20 matches before his international debut. Hence, his T20 strike-rate and average are null.
Had VVS played a decent bit of T20 cricket before his international debut, he would have averaged around 27 and struck at 119. Now, how believable are these numbers? Ask cricket fans from the late 90s and early 2000s, they are going to tell you it is quite believable! This shows the strength of statistical methods such as multiple imputation. When enough original data is available, it always makes sense to prefer multiple imputation over simple imputation.
B. Exploratory Data Analysis
Figure 7 exhibits the upward trend of list a strike-rates as time progressed. But what effect did these have on the ‘motm_perc’?
Figure 7 shows that for batters who debuted before 2007, the average 4 to 8 motm_perc mark was in the 70-85 strike-rate range. Figure 8 shows that this motm_perc marked moved about 15 strike-rates units to the right! That is incredible! This indicates that the bench-mark 50-over --cricket scores went up by about 45 runs in the post T20 era.
Enough about batters! Here is something for the bowling fans.
Figures 9 and 10, together, clearly indicate that the mean and median economy rates started climbing with time. But what is interesting is that post 2006, teams started to prefer bowlers with lower strike rates (bowlers who could pick up wickets more frequently). This means that teams didn’t mind bowling expensive bowlers, as long as they could bowl out the opposition quickly.
Figure 11 describes the relationship between the golden ratio of cricket and motm_perc. The golden ratio is the ratio between the batting average and bowling average of an all-rounder. It is a general belief that a ratio closer to 1 is indicative of a better all-rounder. The graph above seems to validate the belief!
???????C. Before PCA
As the figure below suggests, motm_perc shared no strong linear relationship with other parameters.
As expected, statsmodel’s GLM did not yield favorable results even after filtering of independent variables based on p-values and variation inflation factors (VIF). Though the R-square score was nearly 70% for the training data, it fell to a feeble 23% when the linear regression model was fit on the testing data. Similar problems persisted for all three data-frames – Batters, Bowlers and All-rounders.
Feature Engineering, Recursive Feature Elimination and penalizing the linear model (Lasso and Ridge regression) did not mitigate the overfitting. So, a Random Forest model was tried on the data.
Mean Absolute Percentage Error is similar to MAE, but in percentage form.
GridSearchCV identified that the best performing model had a NMAE of -1.829 and MAPE of 65.43%, on the test data. The NMAE suggests that the predicted motm_perc on an average, was off by 1.8, which might not seem too bad. In simpler terms, it suggests that the model may predict that a batters will win you 10 in 100 matches, when he actually might win you anywhere between 8 and 12 matches. However, the MAPE suggests that the error rate is over 60%, which may not be that bad for this sort of a scenario, but can be better.
Before applying Principal Component Analysis (PCA) on the data-frame, the important features and the extent of their importance was visualized using Random Forest’s feature_importances.
The importance of features is calculated based on node impurities. A node probability is computed by weighing the impurity of a given node against the probability of a tree reaching that node. Higher the node probability, higher the importance.
It is clear that some features are more important than others, in predicting the international success of the batters. Coaches are probably right - a good batter invariably does well in first-class cricket!
???????D. After PCA
PCA was applied on the data-frame. PCA transforms the dataset onto a lower dimensionality subspace. The aim is to absorb as much as information as possible from as fewer parameters as possible. Linear transformations cause a change in the values of the data.
PCA suggested that 12 variables captured more than 90% of the data.
On fitting a Random Forest model on this data, the results significantly improved. MAPE of test data was now only 22%. Observe the predicted values before and after PCA.
A similar approach was directed towards the bowler and all-rounder data-frames.
Figure 22 suggests that all-rounders are more credited for their experience followed by batting skills, bowling economy and strike-rate. In fact, a player with a lower economy-rate is in more demand!
TABLE I – Positive effect of PCA
Category (in increasing order of dimensionality) |
Random Forest MAPE before PCA |
Random Forest MAPE after PCA |
Batters |
0.64 |
0.22 |
Bowlers |
0.71 |
0.38 |
All-rounders |
0.91 |
0.41 |
The table above exemplifies the positive effect of PCA, especially when working with high dimensionality datasets. Reduction in dimensionality and capturing of relevant information is the key to obtaining stronger models.
Currently, Sports Analytics is sparse in India but it is definitely the future. The introduction of T20 format has definitely changed the face of cricket. With limited time and resources, were able to predict the success rate of a player in international cricket just by using the players’ domestic stats. Imagine what specialists could do with extensive data such as ball-by-ball and match-by-match stats, the performance of other players in the same match, venue of the matches, weather conditions and various other factors. Such is the power of Sports Analytics. Not only cricket, Analytics could do wonders in other sports as well.
[1] Satyam Mukherjee, Quantifying individual performance in Cricket — A network analysis of batsmen and bowlers, Physica A: Statistical Mechanics and its Applications, Volume 393, 2014, Pages 624-637, ISSN 0378-4371, https://doi.org/10.1016/j.physa.2013.09.027. [2] Vipul Punjabi, Rohit Chaudhari, Devendra Pal, Kunal Nhavi, Nikhil Shimpi, Harshal Joshi. (2019). A survey on team selection in game of cricket using machine learning, International Research Journal of Engineering and Technology. [3] Hemanta Saikia, Dibyojyoti Bhattacharjee & Unni Krishnan Radhakrishnan (2016) A New Model for Player Selection in Cricket, International Journal of Performance Analysis in Sport, 16:1, 373-388, DOI: 10.1080/24748668.2016.11868893 [4] Subramanian Rama Iyer, Ramesh Sharda, Prediction of athlete’s performance using neural networks: An application in cricket team selection, Expert Systems with Applications, Volume 36, Issue 3, Part 1, 2009, Pages 5510-5522, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2008.06.088.
Copyright © 2022 Vishal C V, Sathvik K B, Nischay N, Manoj Athreya H, Sagar N. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET39354
Publish Date : 2021-12-09
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here